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Abstract 


We study the estimation error of constrained M-estimators, and derive 
explicit upper bounds on the expected estimation error determined by the 
Gaussian width of the constraint set. Both of the cases where the true 
parameter is on the boundary of the constraint set (matched constraint), 
and where the true parameter is strictly in the constraint set (mismatched 
constraint) are considered. For both cases, we derive novel universal esti¬ 
mation error bounds for regression in a generalized linear model with the 
canonical link function. Our error bound for the mismatched constraint 
case is minimax optimal in terms of its dependence on the sample size, 
for Gaussian linear regression by the Lasso. 

1 Introduction 

Consider a general statistical estimation problem. Let (j/i,..., 2/„) be a sample 
following a probability distribution Pgi, in a given class V := {Pg : 0 S We 
are interested in estimating the parameter 0^, given (yi,... ,?/„) and P, under 
the high-dimensional setting where n < p. 

If 9^ is known to satisfy g{9^) < c for some continuous convex function g and 
positive constant c, we can consider a constrained M-estimator of the form 


9 e argmin{/„(0) : 9 e G} , G := {9 eW : g{9) < c} . 
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( 1 ) 


We assume that /„ is a continuously differentiable convex function, and the 
constraint set G is non-empty. For example, the Lasso [32] corresponds to 



( 2 ) 


for some ai,... ,an G M.P and positive constant c . A matrix 0 G can be 
vectorized as a corresponding vector 9 G cP = p. In the low-rank matrix 
recovery problem [ads], a popular estimator corresponds to 



( 3 ) 


1 


for some Ai,...,An S and positive constant c, where H-H^ denotes the 

nuclear norm. In general, /„ can be the normalized negative log-likelihood 
function, or any properly defined function, and g depends on the a priori infor¬ 
mation on the structure of the parameter 0^ [J |H1 [H]- 

One can also consider a penalized M-estimator, given by 

fi'penaiized £ arg min {/„(0) -I- Png[d)} , (4) 

for some positive constant The penalized M-estimator can be computed 
by fast proximal methods, provided that the proximal mapping of g is easy to 
compute [ana. This condition, however, is not always satisfied. For example, if 
g is the nuclear norm, computing the corresponding proximal mapping requires 
a full singular value decomposition (SVD) in the first few iterations, and hence 
is not scalable with the parameter dimension. In contrast, if we consider a 
constrained M-estimator and compute it by the Frank-Wolfe algorithm, each 
iteration of the algorithm requires a linear minimization oracle (LMO), which 
can be approximated efficiently by Lanczos’ algorithm m- The paper [33] also 
shows that when g is a, structured sparsity regularizer, the LMO can be much 
easier to compute than the proximal mapping. 

If we consider a constrained M-estimator, setting the value of the constant 
c in 0 becomes a practical issue. For the case c < g{d^), the estimation error 
is obviously bounded below by the distance between 6^ and the constraint set 
Q, and hence estimation consistency is impossible. Ideally we would like to 
set c = g{0'^), while in practice g{0^) is seldom known. The last case is when 
we have some estimate on g(0^), and choose c such that c > g{d^)- Some 
natural questions arise: Is estimation consistency possible? How fast will the 
estimation error decay with the sample size n? Does setting c > g{0^) result 
in larger estimation error than setting c = 5(0^)? We review related works in 
Section |2 which shows that answers existed only for specific cases even when 
c = 5(0^). 

In this paper, we provide a unified analysis for constrained M-estimators. 
Specifically, 

• We propose an elementary framework for analyzing any M-estimator ap¬ 
plied to any statistical model in Section jSj 

• We obtain universal error bounds in terms of the Gaussian width, valid for 
all canonical GLMs. We consider the matched constraint case (c = g{d^)) 
in Section m and the mismatched constraint case {c> g {6'^)) in Section |SJ 

• To illustrate the universal error bounds, we specialize the universal error 
bound to Gaussian linear regression with arbitrary convex constraint, and 
regression in canonical GLMs with the £i-constraint in Section (6] and 
obtain explicit results. 

• Our error bound for the Lasso applied to the Gaussian linear model is 
optimal in the minimax sense (cf. Section |^. 
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Existing results for penalized M-estimators [H SI O [TH SSI IHl 155] , which 
are for deterministic /o„’s, cannot directly recover our results, and vice versa. 
Indeed, by Lagrange duality, there exists some pn > 0 such that the constrained 
M-estimator in © is equivalent to the penalized M-estimator in (SI- This cor¬ 
respondence, however, holds only for given realization of the sample (j/i,..., ?/„), 
and hence pn is a random variable depending on the sample. Conversely, for 
any penalized M-estimator ^penalized for some pn > 0, there exists a constant 
c = 5(0penaiized) such that the Corresponding constrained M-estimator (|3]) is 
equivalent to 6*penaiized- Note that c = g(6*penaiized) is again a random vari¬ 
able and dependent on the sample. We are not aware of any existing work on 
characterizing the correspondence between the two formulations. 


2 Related Works 

In |231124j , the authors derived sharp estimation error bounds for regression in 
the linear model by constrained least squares (LS) estimators. The analysis in 
|38j provides a minimax estimation error bound for the same setting . There 
are some related works on learning a function in a function class [mill]. When 
the function class is linearly parametrized by vectors in and the function 
corresponding to 9^ is in the function class, the L 2 -estimation error in the func¬ 
tion class may be translated into the £ 2 -estimation error with respect to A 
common limitation of [HI m [Ml m 135] is that the results are not extendable 
to general non-linear statistical models. 

Another research direction considers constrained estimation in possibly non¬ 
linear statistical models [MllllllI]- A constrained M-estimator for logistic 
regression was proposed and analyzed in [M] • In the authors proposed and 
analyzed a universal projection-based estimator for regression in generalized 
linear models (GLMs). In [M], the authors analyzed the performance of the 
constrained LS estimator in GLMs. A common limitation of [Ml Hi HZ] is that 
the results are valid only for the specific proposed estimators, and they do not 
even apply to the constrained maximum-likelihood (ML) estimator, which is 
the most popular approach in practice. Moreover, the proposed estimators in 
IMIIMIIIZ] can only recover the true parameter up to a scale ambiguity. 

We say that the constraint is matched if 9^ lies on the boundary of Q in 
(m (or c = g{9^)), and mismatched if 9'^ lies strictly in Q [or c < gl9'^)). The 
analyses in |231124] require the constraint to be matched, while in practice the 
exact value of g{9^) is seldom known. The constraint in m is always matched 
due to the special structure of quantum density operators. The error bounds in 
IM1I31] can be overly pessimistic, because they hold for all S C/. The results 
in miMiiiz] do not require a matched constraint and depend on 0^; our result 
is of this kind. Recall that, however, [T5] is limited to specific statistical models, 
and [MIHZ] are limited to specific M-estimators. 
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3 A Geometric Framework 


3.1 Basic Idea 


To illustrate the basic idea of our framework, let us start with a simple setting, 
where /„ is strongly convex with parameter /r > 0, i.e., 

(V/n(y) - V/„(x),j/ - a;) >^i\\y-x\\l, 


for any x,y G dom/. Note that then 9 is uniquely defined. 

Dehne tg : —>■ MUj+oo} as the indicator function of the constraint set Q; 

that is, Lg{9) = 0 if 9 G Q, and Lg{9) = +oo otherwise. By the strong convexity 
of fn, we have 


yUe)-yU9^),e-e^)>y 


0-1 


(5) 


By the convexity of tg, or the monotonicity of the subdifferential mapping, we 
have 


z-z^J- 


> 0 , 


( 6 ) 


for any i € dig{9), and any G dLg{9^). Summing up (O and ([5]), we obtain 


(v/„( 0 ) + i - V/„( 0 ^) -z\9- 9^) > y 


9-9^ 


for any z G dig{9). By the optimality condition of 0, there exists some z G dig{9) 
such that 

O = V/„(0) + z, (7) 

and hence we have 


(-V/„(0^)-z^0-0^) >/r 


0 - 0 ^ 


for any G 5ig(0^). Since i9tg(0^) is always a closed convex cone, we can choose 
z^ = 0 and obtain 


-V/„(0^),0-0^) 


0 - 0 ^ 


( 8 ) 


Applying the Cauchy-Schwarz inequality to the left-hand side, we obtain 

l2 0 - 0 ^ >y 


||V/„(0^)| 


or 


0-1 


1 






(9) 


Taking expectations on both sides, we immediately obtain the following estima¬ 
tion error bound: 

0-0^ <-E ||V/„(0^)|L. (10) 

2 fj. " 


E 
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The gradient at the true parameter, V/„(0^), usually concentrates around 0 
with high probability. 

The simple error bound (nni) is not desirable for two reasons: 

1. In the high-dimensional setting where n < p, /„ cannot be strongly convex 
even for the basic LS estimator. 

2. It does not depend on the choice of g. 

We address the first issue in Section [321 and the second issue in Section 1321 

3.2 Restricted Strong Convexity 

Note that in order to facilitate the arguments in the previous sub-section, we 
only require ([S]) to hold for 9 and 0^, instead of any two vectors in R.^. Therefore, 
we only need /„ to satisfy some restricted notion of strong convexity. Similar 
(but not exactly the same) ideas had appeared in (8] [21] , and can be traced 
back to m [33]. 

Definition 3.1 (Feasible Set and Feasible Cone). The feasible set of g at 0^, 
denoted by iFg{9^), is given by 

:=g-e^ = {e-9^:9Gg}. 

The feasible cone of g at d*', denoted by Tg{9^), is defined as the conic hull of 
J^g{9^). 

By the definition of 0, the estimation error must satisfy 9 — 9^ G Tg{9^). 

Definition 3.2 (Restricted Strong Convexity). The function /„ satisfies the 
restricted strong convexity (RSC) condition with parameter /i > 0 if 

{VUe^+e)-VU9^),e)>g\\e\\l, (II) 

for any e G J^g(0*'). 

If fn is twice continuously differentiable, we have a sufficient condition. 

Proposition 3.1. The function fn satisfies the RSC condition with parameter 
g,>0 if 

(e,VV„(6'^ + Ae)e) > p||e|| 2 , 
for all X G [0,1] and all e G iFg{9'^). 

The uniqueness of 9 and the derivation of the error bound in Section [3.II are 
still valid even when n < p, as long as fn satisfies the RSC condition with some 
parameter p > 0. 
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3.3 Refined Error Bound 


We address the dependence of the estimation error on the choice of g, and derive 
a refined error bound in this sub-section. 

We note that 




6»-i 


where Il g ^^ (■) denotes the projection onto the conic hull of |0 — (which is 
a half-line or {0}). This implies, by ([8]), 


n- 




(-v/„(0^)) 


> M 


9 - 6 ^ 


The left-hand side, however, is not tractable due to its dependence on 9. As 
0 — 0^ S J^g{9^) by definition, we consider a looser bound: 


n 




(-v/„(0^)) 


> M 



( 12 ) 


where (ei;) (') denotes projection onto the feasible cone Tg{9^). 

Taking expectations on both sides, we obtain the following lemma. 

Lemma 3.2. Assume that fn satisfies the RSC condition with parameter fi > 0. 
Then 9 is uniquely defined, and satisfies 


E 


9-9^ 

1 

V 


2 




Since —V/„(0^) is a descent direction of fn, if its direction is coherent with 
the feasible cone iFg{9^), we may find some point 9' far away from 0^ in the 
feasible set iFg{9'’^) such that fn{d') is much smaller than /„(0^), and hence the 
estimation error can be large. This provides an intuitive interpretation of the 
lemma. 

Since projection onto a closed convex set is a non-expansive mapping, we 
have 


n- 




(-V/n(0^)) < ||V/„(0^)| 


2 ’ 


SO the error bound is always no larger than the one in Section 13.11 
Lemma is the theoretical foundation of the rest of this paper. 


4 Estimation Error Bound in Terms of the Gaus¬ 
sian Width 

We apply Lemma [3.21 to constrained ML estimators in a GLM with the canon¬ 
ical link function. Examples of a canonical GLM include the Gaussian linear, 
logistic, gamma, and Poisson regression models. 


6 
















































Let € K.P be the parameter to be estimated, or the unknown vector of 
regression coefficients. In a canonical GLM, the negative log-likelihood of a 
sample y, given is of the form (up to scaling and shifting by some constants) 

i( 2 /; 0^) = y (a*, 9^) - b{{a^, 0*’)), 

where oi,..., a„ G are given, and we assume that b is some given concave 
function. Let (j/i,... ,y„) G R" be the sample. The constrained ML estimator 
is given by ([T]) with 

1 " 

/„(0) :=-^L(2/.,0), (13) 

n ^' 

2—1 

and g being some continuous convex function. For simplicity, we consider the 
case where c = g{9'^) in this section; we address the case where c > g{9^) in 
Section [5] 

We specialize Lemma 13.21 to the canonical GLM and obtain the following 
theorem. 


Definition 4.1 (Gaussian width [8l [Ill [33] ). Let C C Rp. The Gaussian width 
of C is given by 

tLit(C);=E sup {(/i,?;)}, 
uecntsp-i 

where h := {hi ,..., hp) is a vector of i.i.d. standard Gaussian random variables, 
and denotes the unit ^ 2 -sphere in Rp. 


Theorem 4.1. Consider the canonical GLM and the corresponding ML esti¬ 
mator described above for c = g{9'^). Assume that the entries of ai,... ,an are 
either all i.i.d. standard Gaussian or all i.i.d. Rademacher random variables 
(random variables taking values in {-1-1, —1} with equal probability), and fn sat¬ 
isfies the RSC condition for p, > 0 with probability at least 1(2. Then 


E 


6 »-( 


< 2y/^ 


<^max " 


Pi/n 


where CTmax := max^ y/varyi. 

Remark. Note that the expectation is with respect to A and s, conditioned on 
the event that the RSG condition holds. 

The feasible cone iFg{d^) coincides with the tangent cone of g at 0^ defined 
in [8]. Therefore, to evaluate the estimation error bound, we only need to eval¬ 
uate the Gaussian width of the corresponding tangent cone. We note that there 
are already many results for a variety of commonly used regularization func¬ 
tions, such as the £i-norm, nuclear norm, total variation semi-norm, and general 
atomic norms [i IHKIIl HSilSnilSHj- Therefore, for most of the applications, we 
only need to plug in an existing bound on the Gaussian width. 

Finally, we would like to emphasize that the Gaussian width in Theorem 
mu comes from bounding the random process induced by the random gradient 
V/„(0^) (cf. the proof of Theorem 14.IL instead of being a consequence of 
applying Gordon’s Lemma. That is, our result is essentially different from those 
in [SIESIEI]. 
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5 Effect of a Mismatched Constraint 


In this section, we discuss the effect of a mismatched constraint for ML regression 
in a canonical GLM. Recall that the constraint set Q is called mismatched if 
c > g{9^) in ([T]). 

The notion of the RSC in Definition o is no longer meaningful when the 
constraint set is mismatched. Take ML regression in the Gaussian linear model 
for example, for which the corresponding /„ is given in ([2]). Let A S be 

defined as in Theorem 14.II The RSC condition requires 

(V/„(0^ + e) - V/„(0^), e) = i ||Ae||^ > gi ||e||^ , 

for some /r > 0 and all e G iFg{9^), where we say e G J-g{9^) instead of e G J-g{9^) 
because kl is a linear operator. Since when the constraint is mismatched, iFg{9'^) 
is the whole space MP, the RSC condition requires A to be a non-singular matrix. 
This cannot be true in the high-dimensional setting, where A G and n < p. 

Our Approach: Let t > 0 and denote by B the unit 1 ' 2 -ball in MA. We 
partition the feasible set iFg{9'^) as 

Tgi9^) = iTgi9^)ntB)UiTgi9^)\tB). 

When t is large enough, the conic hull of \ tB) will not be the whole 

space RP, so it is possible to have restricted strong convexity on \ tB) 

when n < p. If the error vector 9 — 9^ lies in {iFg{9'^) \ tB), we can obtain an 
error bound, say, t, as in Section [d] otherwise, if the error vector lies in Bt, a 
naive error bound is the radius of the ball, i.e., t. Finally, we can bound the 
estimation error from above by the maximum of t and t. Note that t is implicitly 
dependent on t. 

The arguments in the previous paragraph can be made precise as in Lemma 
15.11 which is an analogue of Lemma 13.21 in the mismatched case. Lemma 15.11 
holds for arbitrary constrained M-estimators of the form o and statistical 
models. 


Lemma 5.1. Suppose that for some t> 0, we have 

(V/„(0^ + e) - V/„(d^), e)>p ||e||^ , (14) 

for some /i > 0 and all e G iFg{9'^)\ tB. Then 


E 



< t-bE 




We can also prove an analogue of Theorem 14.II for constrained ML regression 
in a canonical GLM. 


Corollary 5.2. Consider the canonical GLM and the corresponding ML esti¬ 
mator described in Section^ for c > g{9'^). Let A be defined as in Theorem\4.1\ 














and let t > 0. Suppose that (IZP holds true with for some pt > 0 with probability 
at least 1/2. Then we have 


E 


9-9^ 


t 2,'\/2,7T O max 

2 


o:^(Tgm\tB) 


where Cmax is defined as in Theorem \4. 1\ 

The proofs of Lemma 15.11 and Corollary 15.21 are similar to the proofs of 
Lemma [3.21 and Theorem 14.11 respectively. 


6 Applications 


Once the conditions (HU and da are verified, our results Theorem Q] and 
Corollary 15.21 immediately follow. We explicitly verify the conditions for two 
applications and obtain the corresponding estimation error bounds. 

The first application is regression by the constrained LS estimator in a Gaus¬ 
sian linear model. Let 9^ € and ai,..., a„ be vectors in The sample is 
given by 

yi = (ai,9^} + awi, i = l,...,n, 

for some a > 0, where wi,..., Wn are i.i.d. standard Gaussian random variables. 
We consider the constrained LS estimator, for which /„ is given by (El, and 
Q := {9 : g{9) < c} for some c > g{9^), where g can be any convex continuous 
function. 


Corollary 6.1. Consider the Gaussian linear model and the constrained LS 
estimator described above. Assume that the entries of ai,... ,an are either all 
i.i.d. standard Gaussian or all i.i.d. Rademacher random variables. Let e G 
(0,1). For any t > 0, there exist positive constants ci and C 2 such that if 


y/n > 


ClQV(j-g(d^)\tg) 

e 


then we have 


E 


0-1 


<t + 

2 


(1 -e)Vn 


with probability at least 1 — exp(—C 2 e^n) > Xjl when n is large enough. 


(15) 

(16) 


Remark. When the constraint is matched, we can simply set t = 0. Recall that 
t cannot be zero for the mismatched constraint case when n < p (cf. Section [SI. 
This remark also applies to Corollary 16.21 below. 

Remark. For the mismatched constraint case, Corollary dSH is minimax op¬ 
timal for the Lasso in the Gaussian linear model. We address this in Section 

El 

Corollary 16.11 is consistent with [^. The result in [24] is sharper, while 
Corollary 16.11 is more general as it also covers the mismatched constraint case. 
The second application is .^i-constrained ML regression in a canonical GLM. 
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Corollary 6.2. Consider the canonical GLM and the constrained ML estimator 
described in Section^^ for g{9) := ||0||j^ and c > ||^^||- Assume that fn in 
m is twice continuously differentiable, and the entries of ai,... ,an are i.i.d. 
Rademacher random variables. Let e € (0,1). For any t > 0, there exist positive 
constants ci, and C 2 such that if m is satisfied, then we have 


E 


e-e^ 


t 2v^ 2,7T O max 

2 


o:^{Tgm\tB) 

(1 -e)V« 


(17) 


with probability at least 1 — exp(c 2 e^n) >1/2 when n is large enough, where 
Cmax := maxi y/vaFyi is bounded above by a constant independent of n. 

To the best of our knowledge, there are not existing results for ^i-constrained 
ML regression in GLMs. Here we compare Corollary 16.21 with |20) . which pro¬ 
vides an error bound for £i-penalized ML estimators in GLMs . Recall that, 
however, the correspondence between the constrained and penalized estimators 
is currently unclear. When the constraint is matched and 9^ is s-sparse, Corol¬ 
lary [HU] states that when n = H(s log(p/s)), 


E 


9-9^ 



by Proposition 3.10 in [8], which essentially coincides with Corollary 5 in |20p1 . 
We note that [20] only provides an error bound for the £i-penalization case. 


7 Sharpness of Our Error Bound 


It has been shown that in a Gaussian linear model with Q being an £i-ball, any 
estimator /^arbitrary must Satisfy, with probability larger than 1/2, 


max 

eiiee 


^arbitrary 





under some technical conditions m- Now we show our error bound for the 
Lasso in Corollary 16.11 actually achieves the error decaying rate 0(n' -1/4) in the 
mismatched constraint case, and hence cannot be essentially improved. 

By the definition of the Gaussian width, we have, for any t > 0, 


Wi 




U!t 


tB] = 




tB 


wt 




t t 

and hence the estimation error bound in Corollary 1161 can be written as 

Cu;t{Fgi9^)) 


E 


d- I 


<t + 

2 t 


\tri 


(18) 


^ We cite m instead of the published version ED. because the estimation error bound only 
appears in m- 
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for some C > 0, when n is large enough such that (113 is satisfied. 
Define the global Gaussian width: 

:= E sup {(/i,a;)}, 


where h S is a vector of i.i.d. standard Gaussian random variables. By defi¬ 
nition, uJt{iFg{9'^)) is bounded above by uj{iFg{9^)), independent of n. Replacing 
uJt{Tgi9^)) hy u;iTg{9^)) in d]), we have a looser error upper bound: 


E 


9 - 9 '^ 


< t + 

2 


t 


Minimizing this bound over all t > 0, we obtain the 0{n error decaying 
rate. Similar discussion can be found in m- 


8 Discussion 

Note that by the elementary argument in Section [31 we arrive at an estimation 
error bound (HU that holds surely. It is possible to derive a concentration-type 
error guarantee based on this sure error bound, which we are working on. 

Our framework is not restricted to constraint sets of the form O; it applies to 
any non-empty closed convex set Q, as we only require Lg{-) to be proper closed 
convex in the proof. This observation is crucial to applying our framework to 
analyze constrained estimators for quantum tomography [13 HH and photon- 
limited imaging systems d], which we are studying. 

In this paper, we consider a random matrix A, and discuss the expected esti¬ 
mation error with respect to both A and the sample (j/i ,... ,yn). The extension 
to the the case where A is deterministic is technically non-trivial, and we have 
not obtained a satisfactory result. We address this in the remark following the 
proof of Theorem 14.11 in the appendix. 
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A Proof of Proposition 3.1 

We have 

(V/„(0^+e)-V/„(0^),e) = [ {e,V^fni9^+Xe)e) dX. 

Jo 

The right-hand side is always larger than /r ||e|l 2 by assumption. 
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B Proof of Theorem 4.1 






The main goal of the proof is to evaluate E If- 

expectation is with respect to both A and the sample {yi)i=i,. 
We start with an equivalent formulation: 


Here the 


E 


nT^(-v/n(0^)) 


= E sup {(-v/„(0^),u)}, 

i>e^g(e>i)nsp-i 


(19) 


where S’^ ^ denotes the unit ^ 2 -sphere in MP. It is well known that in a canonical 
GLM, we have 

V/„(0^) = --A'^e, (20) 

n 

where e := {yi — ^yi)i=i,...,n, and hence 


E 


n^(-v/„(0^)) 


= ^E 
2 n 


sup {(H^e,u)}. 




To proceed, we need the following symmetrization inequality. The sym- 
metrization inequality is different from the well-known symmetrization inequal¬ 
ity by a Rademacher process, so we show it here for completeness. 

Lemma B.l ([36]). Let ^i,... ,^n be independent real-valued random variables, 
and let J- be a class of real functions. We have 



[/(6)-E/(6)] 


< a/^E sup 




where hi,... ,hn are i.i.d. standard Gaussian random variables. 

Remark. In |36j . the lemma is stated for the case when ^i,..., ^„ are i.i.d. The 
case when ^i,..., ^„ are not necessarily identical can be proved in a similar way, 
as noted in pS] . 

By Lemma lB.il we have 

E sup =E sup {(e,Hu)} 

t;G.Fg(eH)n5P-i 

< a/^E sup {{h-e,Av)}, 
«eJg(?qn5p-i 

where h-e := (hi£j)j=i and hi,... ,hn are i.i.d. standard Gaussian random 
variables. Note that h ■ e is a, random Gaussian vector with zero mean and 
covariance matrix E G K"^" which is dependent on A in general; moreover, 
since the entries in e are independent, E is a diagonal matrix with diagonal 

— ~ _1 /o 

entries given by Ei_i := var Define h := (/ii)j=i^.where hi := E^ T hi£i. 
Then h is a vector of i.i.d. standard Gaussian random variables; furthermore. 


12 























it is still a vector of i.i.d. standard Gaussian random variables condition on A, 
and hence it is statistically independent of A. 

Since h ■ e and have the same probability distribution, we can write 

E sup {(h-e,Av)}=E sup Au')|. 

Let T := n SP~^. Condition on any given A (and hence E), we consider 

two mean-zero Gaussian processes {Xt}f.^j- and {Yt}^^j- defined as 

Xt := (VyIi, Atj , Yt := CTmax (h, Atj , 

where amax ■= max^ Ej^i = max, ^var Si. We have, for any ti,t 2 € T, 

E IW, - I" = \\YAih - t2)\\l < aLx \\A{h - t2)\\l = E IF*, - Yt, f . 

By Slepian’s lemma, this implies 

E supXj < E supYj. 
isT teT 

Since the inequality holds given any realization of A, we have 
E sup {(A^e,u)} < -x/^CTmaxE sup 

= V^CrmaxE SUp UA'^h,v)\. 

veJAWjrSi-^ ^ ^ 

It remains to prove 

E sup l(^A^h,v^\<\/nu)i{Tg{9^)):='/n¥. sup 

( 21 ) 

We consider two cases: 

Case 1: If A has i.i.d. standard Gaussian entries, then condition on ft,, A^h is 

a vector of mean-zero Gaussian random variables with covariance matrix ft /, 

2 

and hence has the same probablity distribution as ft ft, where ft is a vector of 
i.i.d. standard Gaussian random variables independent of ft. Therefore, 

E sup /(^l^ft,u*)! = E sup ft ft,?^*)! 

v(^Tg(e'f)nSP-'^ ^ ^ veYAArSP-^ ^ ^ 

= (% sup {(^v)} 

«G.Fg(eii)nsr-i 

< v^Wi(J^). 
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Case 2: If A has i.i.d. Rademacher entries, then condition on A, A^h is a 
vector of mean-zero Gaussian random variables with covariance matrix nl, and 
hence has the same probability distribution as ^/nh, where h is a vector of i.i.d. 
standard Gaussian random variables. Therefore, 

E sup = E sup {(y/nh,v)} 


In summary, we obtain 


E 


n 




2 sjn 


if the entries of A are i.i.d. standard Gaussian or Rademacher random variables, 
for a canonical GLM, where the expectation is with respect to both A and the 
sample {yi)i=\,...,n- 

Let £ denote that event that the RSG condition holds. Then we have 


E 


n 


T^(-v/„(0^)) =nmA,iy.)\e 


n 




(-V/n( 0 ^)) 




n- 




(-v/„( 0 ^)) 


and hence 


E 


'A,iyi)\£ 


n- 


££9 A) 


(-v/„( 0 ^)) 


E 


< 


n^(-v/„( 0 ^)) 


< 2E 


¥{£) 

Uy^i-vue^)) 


where we applied the assumption that > 1/2. By Lemma 3.2, this implies 

1 , 


E 


A,e\S 


2 - 


n- 




(-v/„( 0 ^)) 


< 2'/^ c 




This completes the proof. 

Remark. If we want to adapt this proof to the deterministic A case, a technical 
issue arises when bounding the right-hand side of (EH). As the random process 
|a„ := (ji, , where V := Tg{0'^)r\SP~^, is a mean-zero Gaussian process, 

a standard approach is to bound sup„gy Xy by Slepian’s lemma. Note that, for 
any vi,V 2 G V, 


E 


Xy.^ Xy^ 


= ||A(wi - V 2 ) 
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2 


and hence an upper-bound on E 




would depend on the largest 


eigenvalue of A. The largest eigenvalue of A, however, cannot be bounded above 
by a constant independent of n under the high-dimensional setting. Although 
we can weaken the requirement on A to a restricted smoothness condition as 


||Aw ||2 < \/l -|- e ||u|l 2 , for all u G n 5^ 

which, by Theorem IE. 11 holds with high probability. This condition does not 
imply 

for some dimension-independent constant C > 0, for all vi^V 2 G V. 


C Proof of Lemma 5.1 

Let e := 9 —OK If e G J-g{9^)\tB, following the proof of Theorem 4.1, we obtain 


|e|l 2 <- 


where J-g{9^) \ tB denotes the conic hull of J-g[9^) \ tB. If e G tB, we have the 
naive bound: ||e ||2 <t. Therefore, 


|e ||2 < max \ 


n 


< t -i- 




n- 






(-V/n( 0 ^)) 


The lemma follows by taking expectations on both sides. 


D Proof of Corollary 5.2 

Let e := 6 — OK It e G Tg{9^) \ tB, following the proof of Theorem 4.1, we can 
obtain _ 


E ||e|l2 < 


otherwise, we can bound the expected estimation error from above by t. There¬ 
fore, 


E||e|| 2 < 

< 



t + 


<^max 


_ u,^{Tg{e^)\tB)\ 

<^max ,— ( 

Hy/n J 

u:i{Tg{e^)\tB) 
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E Proof of Corollary 6.1 and Corollary 6.2 


The proofs in this section rely on the following theorem |19| . 

Theorem E.l ([H]). Let T Q ML be star-shaped. Let A G n < p, 

whose rows are i.i.d. isotropic subgaussian random vectors with subgaussian 
norm a > 1, and let e G (0,1). Then there exist constants ci and C 2 such that 
for all X gT satisfying 


>i: 


cia^ 


T I := inf < t > 0 : t > 


cia^a;f(T) 

Cy/n 


( 22 ) 


we have 

{l-e)\\x\\l<^^^<{l + e)\\x\\l 

n 

with probability at least 1 — exp [—c^e^nj. 

We note that the sub-Gaussian norm of a vector of i.i.d. standard Gaussian 
entries or i.i.d. Rademacher entries is bounded above by a constant m- 


E.l Proof of Corollary 6.1 

We prove by Corollary 5.2. 

Let A be defined as in Theorem 4.1. We verify the condition (14) by Theorem 
IE.II Since uJt{iFg{9'^) \ tB) = tuJi{iFg{9^) \ tB), the condition (l22l) is equivalent 
to requiring 

cia^0Ji{J^g{9^)\tB) 

~ e 

Once this inequality is satisfied, we can set /i = 1 —e, and the condition (14) hold 
with probability at least 1 — exp (—C 2 e^n/a^). Note that tTmax = \/E wf = a. 
This completes the proof. 


E.2 Proof of Corollary 6.2 

We prove the corollary by Corollary 5.2. 

It is known that 

VVn(0) = -A'^Di9)A 

n 

for the ML estimator in a canonical GLM, where A is defined as in Theorem 4.1, 
and D{9) is a diagonal matrix; furthermore, there exists a continuous strictly 
positive function (f such that the (j,i)-th entry of D{9) is given by (j){{ai,9)). 
Since the entries of A are i.i.d. Rademacher random variables, for any 9 G Q, 

|(a*,6»)| < lloill^ ||6»||i < c. 

By the extreme value theorem, the diagonal entries of D{9) are bounded below 
by a constant n > 0 for all 9 G G, which is independent of n. Similarly, Cniax is 
bounded above by a constant independent of n. 
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The rest of the proof is similar to the last paragraph in the previous sub¬ 
section. By Theorem lE.il if we choose n such that 

“ e ’ 

then the condition (14) holds with probability at least 1 —exp (^—C 2 e^n/a*) with 

^ = jy(l - e). 
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